As college students who began their undergraduate studies in the midst of a pandemic, we witnessed the rise of many new businesses and services. During the pandemic, people were encouraged to stay indoors and avoid public gatherings, so naturally, services which allowed consumers to obtain what they needed without leaving the comfort of their homes rose in popularity. Uber Eats is a service which falls into this category. It allows people to order food from their home by simply using their smartphone to choose a restaurant, enter their address, and pay the fee for the food and delivery. A Uber Eats driver will drive to the store, pick up the food, and deliver it to the front door of the customer. The consumer can enjoy a full meal without having to cook or leave their home, which people considered the safer option during the Covid 19 outbreak.
Today, as society works towards a post-pandemic world, we still see the resilience of services like UberEats. People still prefer the convenience of not having to leave their homes for food. Now, the motivation could be anything from not having transportation options to simply not wanting to make a meal after a long day. The service of having food delivered to your doorstep will most likely be in demand for a while.
For our final CMSC320 project, we decided to explore and analyze Uber Eats data to determine many different relationships different components of restaurants have with each other and the consumers that choose to give them their business.
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
import pandas as pd
import re
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import folium
from folium import plugins
from folium.plugins import HeatMap
from sklearn import linear_model
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from scipy import stats
from sklearn.model_selection import train_test_split, KFold, cross_val_score
We started off using the raw data provided by “Uber Eats Exploration” on the platform Kaggle. This dataset provides us with information about various restaurants that use the platform.
df = pd.read_csv('/content/drive/MyDrive/restaurants.csv')
df
| id | position | name | score | ratings | category | price_range | full_address | zip_code | lat | lng | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 19 | PJ Fresh (224 Daniel Payne Drive) | NaN | NaN | Burgers, American, Sandwiches | $ | 224 Daniel Payne Drive, Birmingham, AL, 35207 | 35207 | 33.562365 | -86.830703 |
| 1 | 2 | 9 | J' ti`'z Smoothie-N-Coffee Bar | NaN | NaN | Coffee and Tea, Breakfast and Brunch, Bubble Tea | NaN | 1521 Pinson Valley Parkway, Birmingham, AL, 35217 | 35217 | 33.583640 | -86.773330 |
| 2 | 3 | 6 | Philly Fresh Cheesesteaks (541-B Graymont Ave) | NaN | NaN | American, Cheesesteak, Sandwiches, Alcohol | $ | 541-B Graymont Ave, Birmingham, AL, 35204 | 35204 | 33.509800 | -86.854640 |
| 3 | 4 | 17 | Papa Murphy's (1580 Montgomery Highway) | NaN | NaN | Pizza | $ | 1580 Montgomery Highway, Hoover, AL, 35226 | 35226 | 33.404439 | -86.806614 |
| 4 | 5 | 162 | Nelson Brothers Cafe (17th St N) | 4.7 | 22.0 | Breakfast and Brunch, Burgers, Sandwiches | NaN | 314 17th St N, Birmingham, AL, 35203 | 35203 | 33.514730 | -86.811700 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 40222 | 40223 | 54 | Mangia la pasta! (5610 N Interstate Hwy 35) | 4.8 | 500.0 | Pasta, Comfort Food, Italian, Group Friendly | $ | 5610 N I35, Austin, TX, 78751 | 78751 | 30.316248 | -97.708441 |
| 40223 | 40224 | 53 | Wholly Cow Burgers (S Lamar) | 4.6 | 245.0 | American, Burgers, Breakfast and Brunch, Aller... | $ | 3010 S Lamar Blvd, Austin, TX, 78704 | 78704 | 30.242816 | -97.783821 |
| 40224 | 40225 | 52 | EurAsia Ramen 3 | 4.7 | 293.0 | Sushi, Asian, Japanese, Exclusive to Eats, Gro... | $ | 5222 Burnet Road, Austin, TX, 78756 | 78756 | 30.324290 | -97.740200 |
| 40225 | 40226 | 51 | Austin's Habibi (5th St) | 4.7 | 208.0 | Mediterranean, Gluten Free Friendly, Allergy F... | $$ | 817 W 5th St, Austin, TX, 78703 | 78703 | 30.269580 | -97.753110 |
| 40226 | 40227 | 50 | Beijing Wok | 4.4 | 254.0 | Chinese, Asian, Asian Fusion, Family Friendly,... | $ | 8106 Brodie Ln, Austin, TX, 78749 | 78749 | 30.202210 | -97.838689 |
40227 rows × 11 columns
For each entry, the dataset stores a unique id, the restaurant’s position in the search list, its name, rating, number of ratings, category tags, price range, full address, zip code, and the latitude and longitude. However, some of these columns – mainly latitude and longitude – store missing values as 0 instead of null, which means that even though it seems like there’s a latitude and longitude given for every entry, there are actually some rows with missing values for those fields.
df.count()
id 40227 position 40227 name 40227 score 22254 ratings 22254 category 40204 price_range 33581 full_address 39949 zip_code 39940 lat 40227 lng 40227 dtype: int64
We then decided to rename the columns “score” and “ratings” to “rating” and “number_of_ratings” to improve readability and to avoid potential confusion between the two columns.
# rename the columns so that the column names are more clear
df.rename(columns={'score': 'rating', 'ratings': 'number_of_ratings'}, inplace=True)
df.head()
| id | position | name | rating | number_of_ratings | category | price_range | full_address | zip_code | lat | lng | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 19 | PJ Fresh (224 Daniel Payne Drive) | NaN | NaN | Burgers, American, Sandwiches | $ | 224 Daniel Payne Drive, Birmingham, AL, 35207 | 35207 | 33.562365 | -86.830703 |
| 1 | 2 | 9 | J' ti`'z Smoothie-N-Coffee Bar | NaN | NaN | Coffee and Tea, Breakfast and Brunch, Bubble Tea | NaN | 1521 Pinson Valley Parkway, Birmingham, AL, 35217 | 35217 | 33.583640 | -86.773330 |
| 2 | 3 | 6 | Philly Fresh Cheesesteaks (541-B Graymont Ave) | NaN | NaN | American, Cheesesteak, Sandwiches, Alcohol | $ | 541-B Graymont Ave, Birmingham, AL, 35204 | 35204 | 33.509800 | -86.854640 |
| 3 | 4 | 17 | Papa Murphy's (1580 Montgomery Highway) | NaN | NaN | Pizza | $ | 1580 Montgomery Highway, Hoover, AL, 35226 | 35226 | 33.404439 | -86.806614 |
| 4 | 5 | 162 | Nelson Brothers Cafe (17th St N) | 4.7 | 22.0 | Breakfast and Brunch, Burgers, Sandwiches | NaN | 314 17th St N, Birmingham, AL, 35203 | 35203 | 33.514730 | -86.811700 |
To analyze the data using location as a variable, we decided to create a new column called “State” that would contain each entry’s state location stored as its two-letter abbreviation. In order to do this, we first used the “full_adress” of each state and used a regular expression to pull the two-letter abbreviation if it was in the address. If it was not in the address, the State value would be set to None.
# making new column with state data
lst = []
for a in df["full_address"]:
result = re.search('^(([^,]+,)+) ([A-Z]{2}),', str(a))
result2 = re.search('^(([^,]+,)+)(,|\s,) ([A-Z]{2})(,|)', str(a))
if result:
lst.append(result.group(3))
elif result2:
lst.append(result2.group(4))
else:
lst.append(None)
df["State"] = lst
# list unique states in the dataset
df["State"].unique()
array(['AL', None, 'WY', 'WI', 'MN', 'IL', 'WV', 'OH', 'WA', 'OR', 'ID',
'VA', 'DC', 'MD', 'TN', 'VT', 'UT', 'PR', 'TX'], dtype=object)
We then tried to fill in these missing entries by using the zip codes given in the dataset to find the location’s state. To do this, we downloaded a zip code dataset containing location information for zip codes in the United States.
# load in zip code data set
zipcodes_df = pd.read_csv('/content/drive/MyDrive/free-zipcode-database-Primary.csv')
zipcodes_df.head()
| Zipcode | ZipCodeType | City | State | LocationType | Lat | Long | Location | Decommisioned | TaxReturnsFiled | EstimatedPopulation | TotalWages | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 705 | STANDARD | AIBONITO | PR | PRIMARY | 18.14 | -66.26 | NA-US-PR-AIBONITO | False | NaN | NaN | NaN |
| 1 | 610 | STANDARD | ANASCO | PR | PRIMARY | 18.28 | -67.14 | NA-US-PR-ANASCO | False | NaN | NaN | NaN |
| 2 | 611 | PO BOX | ANGELES | PR | PRIMARY | 18.28 | -66.79 | NA-US-PR-ANGELES | False | NaN | NaN | NaN |
| 3 | 612 | STANDARD | ARECIBO | PR | PRIMARY | 18.45 | -66.73 | NA-US-PR-ARECIBO | False | NaN | NaN | NaN |
| 4 | 601 | STANDARD | ADJUNTAS | PR | PRIMARY | 18.16 | -66.72 | NA-US-PR-ADJUNTAS | False | NaN | NaN | NaN |
We then parsed through it to find state data for any entry with a zip code, storing this in a column called “state_by_zip”.
# initialize new row
row = []
# iterate through the zip_code column in uber eats dataframe
for zips in df['zip_code']:
# make sure zip code is only 5 digits
zip = str(zips)[0:5]
# check for non-numeric values
if zip.isnumeric():
# Gets the row where the zip codes match and gets State data
l = (zipcodes_df.loc[zipcodes_df['Zipcode'] == int(zip)])["State"]
# if l has information and is not empty
if len((pd.Series(list(l)).values)) > 0:
# add state data to row
row.append((pd.Series(list(l)).values)[0])
# else add nan to row
else:
row.append(np.nan)
# handles non-numeric values
else:
row.append(np.nan)
# check that len of row is same as the dataframe (len = 40227)
len(row)
<ipython-input-8-ea2c5bf95b98>:15: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning. if len((pd.Series(list(l)).values)) > 0:
40227
We then viewed the data by state, and notice there were more unique states in the zip code data than the data pulled from the regular expression.
# Add in updated state data
df["state_by_zip"] = row
# View count data by State
df.groupby("state_by_zip").count()
| id | position | name | rating | number_of_ratings | category | price_range | full_address | zip_code | lat | lng | State | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| state_by_zip | ||||||||||||
| AL | 1102 | 1102 | 1102 | 524 | 524 | 1101 | 944 | 1102 | 1102 | 1102 | 1102 | 1102 |
| AZ | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| CA | 5 | 5 | 5 | 2 | 2 | 5 | 4 | 5 | 5 | 5 | 5 | 5 |
| CT | 5 | 5 | 5 | 3 | 3 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
| DC | 1508 | 1508 | 1508 | 907 | 907 | 1508 | 1099 | 1508 | 1508 | 1508 | 1508 | 1508 |
| FL | 2 | 2 | 2 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| GA | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| ID | 25 | 25 | 25 | 8 | 8 | 25 | 21 | 25 | 25 | 25 | 25 | 25 |
| IL | 208 | 208 | 208 | 127 | 127 | 208 | 188 | 208 | 208 | 208 | 208 | 208 |
| IN | 2 | 2 | 2 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| MA | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| MD | 897 | 897 | 897 | 619 | 619 | 897 | 713 | 897 | 897 | 897 | 897 | 897 |
| MN | 44 | 44 | 44 | 28 | 28 | 44 | 41 | 44 | 44 | 44 | 44 | 44 |
| MO | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| MT | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| NE | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| NH | 5 | 5 | 5 | 0 | 0 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
| NJ | 2 | 2 | 2 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| NY | 5 | 5 | 5 | 1 | 1 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
| OH | 17 | 17 | 17 | 4 | 4 | 17 | 17 | 17 | 17 | 17 | 17 | 17 |
| OR | 1023 | 1023 | 1023 | 522 | 522 | 1022 | 707 | 1023 | 1023 | 1023 | 1023 | 1023 |
| PA | 3 | 3 | 3 | 0 | 0 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
| PR | 201 | 201 | 201 | 142 | 142 | 201 | 148 | 201 | 201 | 201 | 201 | 32 |
| SC | 2 | 2 | 2 | 0 | 0 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| TN | 42 | 42 | 42 | 6 | 6 | 42 | 39 | 42 | 42 | 42 | 42 | 42 |
| TX | 7259 | 7259 | 7259 | 4317 | 4317 | 7258 | 5813 | 7259 | 7259 | 7259 | 7259 | 7258 |
| UT | 3071 | 3071 | 3071 | 1597 | 1597 | 3071 | 2545 | 3071 | 3071 | 3071 | 3071 | 3071 |
| VA | 9248 | 9248 | 9248 | 5581 | 5581 | 9238 | 7966 | 9248 | 9248 | 9248 | 9248 | 9248 |
| VT | 347 | 347 | 347 | 79 | 79 | 347 | 329 | 347 | 347 | 347 | 347 | 347 |
| WA | 8883 | 8883 | 8883 | 5574 | 5574 | 8883 | 7264 | 8883 | 8883 | 8883 | 8883 | 8883 |
| WI | 4307 | 4307 | 4307 | 1681 | 1681 | 4307 | 3886 | 4307 | 4307 | 4307 | 4307 | 4307 |
| WV | 1373 | 1373 | 1373 | 322 | 322 | 1372 | 1322 | 1373 | 1373 | 1373 | 1373 | 1373 |
| WY | 320 | 320 | 320 | 75 | 75 | 320 | 307 | 320 | 320 | 320 | 320 | 320 |
This observation led to the discovery that some of the zip codes in the dataset were not correct and did not correspond to the address that was listed. Fortunately, the state abbreviations listed in “full_address” were correct, so we used “state_in_zip” to fill some missing entries that the regular expression did not fill, and made sure there weren’t many wrong entries.
# make new col
new_state = [];
# add in the states found by zip code into the dataset (but leave in original data)
for i in range (0, len(df['State'])):
if df['State'][i]==0:
new_state.append(df['state_by_zip'][i])
else:
new_state.append(df['State'][i])
df['State'] = new_state
We then viewed the spread of the data by state.
# see the data spread per state
df.groupby("State").count()
| id | position | name | rating | number_of_ratings | category | price_range | full_address | zip_code | lat | lng | state_by_zip | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| State | ||||||||||||
| AL | 1107 | 1107 | 1107 | 526 | 526 | 1106 | 949 | 1107 | 1107 | 1107 | 1107 | 1107 |
| DC | 1511 | 1511 | 1511 | 907 | 907 | 1511 | 1100 | 1511 | 1511 | 1511 | 1511 | 1511 |
| ID | 27 | 27 | 27 | 8 | 8 | 27 | 23 | 27 | 27 | 27 | 27 | 27 |
| IL | 204 | 204 | 204 | 126 | 126 | 204 | 184 | 204 | 204 | 204 | 204 | 204 |
| MD | 895 | 895 | 895 | 619 | 619 | 895 | 712 | 895 | 895 | 895 | 895 | 895 |
| MN | 43 | 43 | 43 | 27 | 27 | 43 | 39 | 43 | 43 | 43 | 43 | 43 |
| OH | 15 | 15 | 15 | 3 | 3 | 15 | 15 | 15 | 15 | 15 | 15 | 15 |
| OR | 1024 | 1024 | 1024 | 523 | 523 | 1023 | 708 | 1024 | 1024 | 1024 | 1024 | 1023 |
| PR | 32 | 32 | 32 | 31 | 31 | 32 | 30 | 32 | 32 | 32 | 32 | 32 |
| TN | 42 | 42 | 42 | 6 | 6 | 42 | 39 | 42 | 42 | 42 | 42 | 42 |
| TX | 7272 | 7272 | 7272 | 4324 | 4324 | 7266 | 5820 | 7272 | 7267 | 7272 | 7272 | 7263 |
| UT | 3085 | 3085 | 3085 | 1603 | 1603 | 3085 | 2557 | 3085 | 3085 | 3085 | 3085 | 3075 |
| VA | 9264 | 9264 | 9264 | 5586 | 5586 | 9254 | 7983 | 9264 | 9264 | 9264 | 9264 | 9262 |
| VT | 347 | 347 | 347 | 79 | 79 | 347 | 329 | 347 | 347 | 347 | 347 | 347 |
| WA | 8895 | 8895 | 8895 | 5580 | 5580 | 8891 | 7270 | 8895 | 8891 | 8895 | 8895 | 8887 |
| WI | 4317 | 4317 | 4317 | 1686 | 1686 | 4317 | 3897 | 4317 | 4317 | 4317 | 4317 | 4311 |
| WV | 1379 | 1379 | 1379 | 322 | 322 | 1378 | 1328 | 1379 | 1379 | 1379 | 1379 | 1379 |
| WY | 319 | 319 | 319 | 75 | 75 | 319 | 306 | 319 | 319 | 319 | 319 | 319 |
Next, we will clean up the category data field.
df['category'].nunique()
10647
When we count all the unique values of categories we are given 10,647 different categories. When looking back at the category list, we can see this is because the cuisines are listed as strings and can have many different pairings of key words in many different orders. We want to clean this data in order for it to be usable in more analysis.
First, lets take a look at the most common category names.
category_counts = df['category'].value_counts().to_frame()
category_counts.head(15)
| category | |
|---|---|
| Burgers, American, Sandwiches | 1619 |
| Mexican, Latin American, New Mexican | 1168 |
| Fast Food, Sandwich, American | 838 |
| Pizza, American, Italian | 714 |
| American, Burgers, Fast Food | 686 |
| American, Burgers, Sandwiches | 484 |
| Burritos, Fast Food, Mexican | 430 |
| Coffee and Tea, American, Breakfast and Brunch | 410 |
| Chinese, Asian, Asian Fusion | 366 |
| American, burger, Fast Food | 360 |
| Pharmacy, Convenience, Everyday Essentials, Baby | 326 |
| American, burger, Fast Food, Family Meals | 313 |
| Breakfast and Brunch, American, Sandwiches | 286 |
| Bakery, Breakfast and Brunch, Cafe, Coffee & Tea | 276 |
| Sandwiches, American, Healthy | 270 |
When looking at the top 15 categories, we can already see lots of overlap in the types of food, just under different names and labels. For example, 'burgers' are part of 5 separate categories listed since the strings can have lots of variation.
Due to this, we will do some cleaning to make more general categories that will have restaurants of the same cuisines in them.
In order to find out which general categories we should sort them into, we found some reports from Uber on the most popular cuisines. (Found at this link https://www.uber.com/newsroom/the-2021-uber-eats-cravings-report/)
From that report the listings of most popular cuisines are as follows:
The most popular cuisines:
Since theses are the most popular cuisines on the app, we figured that the UberEats app would have a relatively large number of these types of restaurants.
# changing category type to string so that we can do manipulations later
df['category'] = df['category'].astype(str)
# sort categories
df['gen_category'] = df['category'].apply(lambda x: 'American' if "burger" in x.lower() else
'Mexican' if "mexican" in x.lower() else
'Mexican' if "taco" in x.lower() else
'Mexican' if "burrito" in x.lower() else
'Chinese' if "chinese" in x.lower() else
'Indian' if "indian" in x.lower() else
'Pizza' if "pizza" in x.lower() else
'Japanese' if "sushi" in x.lower() else
'Japanese' if "japanese" in x.lower() else
'Thai' if "thai" in x.lower() else
'Mediterranean' if "mediterranean" in x.lower() else
'Breakfast' if "breakfast" in x.lower() else
'Breakfast' if "bagel" in x.lower() else
'Breakfast' if "donut" in x.lower() else
'Vietnamese' if "vietnamese" in x.lower() else
'American' if "american" in x.lower() else
'American' if "sandwich" in x.lower() else
'Convenience' if "convenience" in x.lower() else
'Korean' if 'korean' in x.lower() else
'Asian' if 'asian' in x.lower() else
'Italian' if 'italian' in x.lower() else
'Dessert' if 'dessert' in x.lower() else
'Dessert' if 'ice cream' in x.lower() else
'Vegetarian' if 'vegetarian' in x.lower() else
x)
# np.nan
# ("burger"||"american"||"sandwich")
To categorize the entries into more generic and overlapping categories, we looked for key words in the category string entries and then assigned them a general category name in the row gen_category. For example, if we found "burger" or "american" in the category, we assigned them to the "American" category.
We initally only made categories for the 10 listed above, but when displaying the categories created and their counts (as shown below in the table) there were still overlapping category key strings that could be combined. Due to this, we decided to make more categories to hold the types of stores. Some example of the ones added were Italian and Convinience.
better_cat = df['gen_category'].value_counts().to_frame()
better_cat.head(20)
| gen_category | |
|---|---|
| American | 15531 |
| Mexican | 4058 |
| Pizza | 3936 |
| Breakfast | 3222 |
| Convenience | 1750 |
| Chinese | 1675 |
| Dessert | 1558 |
| Japanese | 1390 |
| Indian | 1011 |
| Mediterranean | 778 |
| Thai | 717 |
| Korean | 578 |
| Asian | 515 |
| Italian | 514 |
| Vietnamese | 417 |
| Vegetarian | 333 |
| Juice and Smoothies, Healthy, Fast Food | 78 |
| Retail, Gift Store, Beauty Supply | 75 |
| Alcohol, Liquor Stores, Wine | 64 |
| Halal, Chicken, Middle Eastern | 36 |
df['gen_category'].nunique()
986
For the entries that have not been added to the general categories, we will put NaN values in to make it cleaner when doing data processing by category. We are left with 16 categories as displayed below.
# sort categories
df['gen_category'] = df['category'].apply(lambda x: 'American' if "burger" in x.lower() else
'Mexican' if "mexican" in x.lower() else
'Mexican' if "taco" in x.lower() else
'Mexican' if "burrito" in x.lower() else
'Chinese' if "chinese" in x.lower() else
'Indian' if "indian" in x.lower() else
'Pizza' if "pizza" in x.lower() else
'Japanese' if "sushi" in x.lower() else
'Japanese' if "japanese" in x.lower() else
'Thai' if "thai" in x.lower() else
'Mediterranean' if "mediterranean" in x.lower() else
'Breakfast' if "breakfast" in x.lower() else
'Breakfast' if "bagel" in x.lower() else
'Breakfast' if "donut" in x.lower() else
'Vietnamese' if "vietnamese" in x.lower() else
'American' if "american" in x.lower() else
'American' if "sandwich" in x.lower() else
'Korean' if 'korean' in x.lower() else
'Asian' if 'asian' in x.lower() else
'Italian' if 'italian' in x.lower() else
'Dessert' if 'dessert' in x.lower() else
'Dessert' if 'ice cream' in x.lower() else
'Vegetarian' if 'vegetarian' in x.lower() else
'Convenience' if "convenience" in x.lower() else
np.nan)
better_cat = df['gen_category'].value_counts().to_frame()
better_cat.head(16)
| gen_category | |
|---|---|
| American | 15531 |
| Mexican | 4058 |
| Pizza | 3936 |
| Breakfast | 3222 |
| Convenience | 1720 |
| Chinese | 1675 |
| Dessert | 1581 |
| Japanese | 1390 |
| Indian | 1011 |
| Mediterranean | 778 |
| Thai | 717 |
| Korean | 580 |
| Asian | 517 |
| Italian | 514 |
| Vietnamese | 417 |
| Vegetarian | 336 |
Analysis of Ratings
Next we want to take a look at the ratings of the restaurants and see what patterns can be uncovered. We first plotted the rating score versus the number of ratings. From this scatter plot, we can see that the data is skewed left so most restaurants that have a lot of ratings tend to be on the higher end of the rating score. This was interesting data to be seen by itself but we then decided to see if there was any difference in patterns for ratings and number of ratings by each price point.
# set size of plot
plt.figure(figsize=(12, 8))
font1 = {'size':20}
font2 = {'size':15}
#set the title
plt.title("Rating Score versus Number of Ratings", fontdict = font1)
# naming the x and y axis
plt.xlabel('Rating Score', fontdict = font2)
plt.ylabel('Number of Ratings', fontdict = font2)
# plot the scatter of rating vs number of ratings
plt.scatter(df.rating, df.number_of_ratings)
<matplotlib.collections.PathCollection at 0x7f549453a850>
Before we split the price points up, we first wanted to see how many restaurants each price point contained. The data is sectioned off into 4 price points: $, $, $$$, and $$$$. To make this a bit simpler to follow and talk about we will be referring to the price points as $: low, $$: medium, $$$: high, and $$$$: extremely high. From this data we see that over half of the restaurants in the dataset are in the low price point, the medium price point has less than half of those and the high and extremely high price points do not have many restaurants on UberEats. This pattern is not too surprising since a lot of people use these apps to get fast food delivered to them and those would most likely be at the lowest price point. Also, if a restaurant has a higher price point, it is more likely to be a dine in restaurant and more people tend to go in person and not use the food delivery apps.
Before we split the price points up, we first wanted to see how many restaurants each price point contained. The data is sectioned off into 4 price points: $, $, $$$, and $$$$. To make this a bit simpler to follow and talk about we will be referring to the price points as $: low, $$: medium, $$$: high, and $$$$: extremely high. From this data we see that over half of the restaurants in the dataset are in the low price point, the medium price point has less than half of those and the high and extremely high price points do not have many restaurants on UberEats. This pattern is not too surprising since a lot of people use these apps to get fast food delivered to them and those would most likely be at the lowest price point. Also, if a restaurant has a higher price point, it is more likely to be a dine in restaurant and more people tend to go in person and not use the food delivery apps.
# Look at price range
df['price_range'].unique()
df['price_range'].value_counts()
$ 24385 $$ 9029 $$$ 149 $$$$ 18 Name: price_range, dtype: int64
Moving on, we section off the datasets by price points and then plot their respective rating scores versus number of ratings.
prices = df.groupby('price_range')
prices['rating'].count()
low = prices.get_group('$')
med = prices.get_group('$$')
high = prices.get_group('$$$')
top = prices.get_group('$$$$')
#low plot
# set size of plot
plt.figure(figsize=(8, 6))
font1 = {'size':20}
font2 = {'size':15}
#set the title
plt.title("Rating Score versus Number of Ratings for Low Price Range", fontdict = font1)
# naming the x and y axis
plt.xlabel('Rating Score', fontdict = font2)
plt.ylabel('Number of Ratings', fontdict = font2)
# plot the scatter of rating vs number of ratings
plt.scatter(low.rating, low.number_of_ratings)
### Medium plot
# set size of plot
plt.figure(figsize=(8, 6))
font1 = {'size':20}
font2 = {'size':15}
#set the title
plt.title("Rating Score versus Number of Ratings for Medium Price Range", fontdict = font1)
# naming the x and y axis
plt.xlabel('Rating Score', fontdict = font2)
plt.ylabel('Number of Ratings', fontdict = font2)
# plot the scatter of rating vs number of ratings
plt.scatter(med.rating, med.number_of_ratings)
### High plot
# set size of plot
plt.figure(figsize=(8, 6))
font1 = {'size':20}
font2 = {'size':15}
#set the title
plt.title("Rating Score versus Number of Ratings for High Price Range", fontdict = font1)
# naming the x and y axis
plt.xlabel('Rating Score', fontdict = font2)
plt.ylabel('Number of Ratings', fontdict = font2)
# plot the scatter of rating vs number of ratings
plt.scatter(high.rating, high.number_of_ratings)
### Top plot
# set size of plot
plt.figure(figsize=(8, 6))
font1 = {'size':20}
font2 = {'size':15}
#set the title
plt.title("Rating Score versus Number of Ratings for Highest Price Range", fontdict = font1)
# naming the x and y axis
plt.xlabel('Rating Score', fontdict = font2)
plt.ylabel('Number of Ratings', fontdict = font2)
# plot the scatter of rating vs number of ratings
plt.scatter(top.rating, top.number_of_ratings)
<matplotlib.collections.PathCollection at 0x7f5493fc0d90>
Looking at these plots sectioned by price point, we see the dataset densities get much lower as the price point increases which aligns with the number of restaurants at each price point from before. The low, medium, and high plots have the same general trend of being skewed left so they tend to have higher ratings and a higher number of ratings. The low price point has a higher density of points near the top right of the graph. This means that there are more highly rated restaurants with a greater number of ratings. This could be due to more people ordering from those restaurants and therefore more people writing reviews. There are much fewer restaurants in the high and extremely high plots but the trend is the same for the high plot. The extremely high price range plot does not seem to have any patterns since the plots are scattered far apart from each other.
Below, we added another representation of the plots by price point below.
fig = plt.figure(figsize=(12, 8))
ax1 = fig.add_subplot(111)
ax1.scatter(low.rating, low.number_of_ratings, s=10, c='b', marker="s", label='low')
ax1.scatter(med.rating, med.number_of_ratings, s=10, c='r', marker="o", label='med')
ax1.scatter(high.rating, high.number_of_ratings, s=10, c='c', marker="s", label='high')
ax1.scatter(top.rating, top.number_of_ratings, s=10, c='m', marker="o", label='top')
plt.legend(loc='upper left')
plt.show()
Next we decided to take a look at the best restaurants in the dataset. We decided the best restaurants would be the highest rated restaurants with the highest number of ratings as well. We sorted the restaurants by the two columns mentioned above and we show the top 20 restaurants below.
# try to find the best rated retaurants by rating and number of ratings
highly_rated_restaurant = df.sort_values(['rating','number_of_ratings'], ascending=False)
highly_rated_restaurant.head(20)
| id | position | name | rating | number_of_ratings | category | price_range | full_address | zip_code | lat | lng | State | state_by_zip | gen_category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18401 | 18402 | 68 | Starbucks (S. Van Dorn and Pickett) | 5.0 | 223.0 | Cafe, Coffee & Tea, Breakfast and Brunch, ... | $ | 5782 Dow Ave, Alexandria, VA, 22304 | 22304 | 38.804558 | -77.132929 | VA | VA | Breakfast |
| 28607 | 28608 | 169 | Sundevich | 5.0 | 176.0 | Salads, American, Vegetarian, Sandwich | NaN | 601 New Jersey Ave. NW, Washington, DC, 20001 | 20001 | 38.897830 | -77.011590 | DC | DC | American |
| 23134 | 23135 | 35 | Berries & Bowls | 5.0 | 156.0 | Juice and Smoothies, Healthy, Vegetarian | $ | 120 Market St, Gaithersburg, MD, 20878 | 20878 | 39.122270 | -77.234758 | MD | MD | Vegetarian |
| 22901 | 22902 | 15 | Starbucks (South Riding Blvd) | 5.0 | 137.0 | Cafe, Coffee & Tea, Breakfast and Brunch, ... | $ | 43114 Peacock Market #140, South Riding, VA, 2... | 20152 | 38.915668 | -77.511693 | VA | VA | Breakfast |
| 19926 | 19927 | 86 | Open Road (ROSSLYN) | 5.0 | 136.0 | Burgers, American, Sandwiches | $ | 1201 Wilson Boulevard, Arlington, VA, 22209 | 22209 | 38.895720 | -77.071040 | VA | VA | American |
| 35369 | 35370 | 10 | Cafe Vida (Rogers Ranch) | 5.0 | 114.0 | Breakfast and Brunch, Healthy, Latin American,... | $ | 2711 Treble Creek, San Antonio, TX, 78258 | 78258 | 29.604498 | -98.537205 | TX | TX | Breakfast |
| 11950 | 11951 | 241 | Banh Mi Up | 5.0 | 112.0 | Vietnamese, Noodles, Healthy | $ | 8037 N Lombard St, Portland, OR, 97203 | 97203 | 45.589600 | -122.748510 | OR | OR | Vietnamese |
| 19814 | 19815 | 14 | South Block (Falls Church) | 5.0 | 111.0 | Juice and Smoothies, Healthy, American | $ | 2121 N Westmoreland St, Arlington, VA, 22213 | 22213 | 38.886520 | -77.161690 | VA | VA | American |
| 9333 | 9334 | 3 | Teriyaki Plus | 5.0 | 110.0 | Japanese: Other, Asian, Sushi, Family Friendly | $ | 11512 124th Ave NE, Kirkland, WA, 98033 | 98033 | 47.703596 | -122.175306 | WA | WA | Japanese |
| 27298 | 27299 | 3 | Starbucks (9002 W. Broad Street) | 5.0 | 110.0 | Bakery, Cafe | $ | 9002 W. Broad Street, Richmond, VA, 23294 | 23294 | 37.635630 | -77.547270 | VA | VA | NaN |
| 12197 | 12198 | 65 | Kolby's Donut House | 5.0 | 108.0 | Bakery, Desserts, Sandwich | $ | 15012 Pacific Ave S, Tacoma, WA, 98444 | 98444 | 47.120297 | -122.435306 | WA | WA | American |
| 2105 | 2106 | 188 | Colectivo Prospect | 5.0 | 103.0 | Coffee and Tea, American, Breakfast and Brunch | $ | 2211 North Prospect Avenue, Milwaukee, WI, 53202 | 53202 | 43.059145 | -87.885167 | WI | WI | Breakfast |
| 26807 | 26808 | 65 | sweetgreen (West End) | 5.0 | 102.0 | Healthy, Salads | $ | 2238 M St NW, Washington, DC, 20037 | 20037 | 38.905052 | -77.049380 | DC | DC | NaN |
| 27964 | 27965 | 94 | Starbucks (Oakton) | 5.0 | 101.0 | Cafe, Coffee & Tea, Breakfast and Brunch, ... | $ | 2930 Chain Bridge Road, Oakton, VA, 22124 | 22124 | 38.882349 | -77.299989 | VA | VA | Breakfast |
| 30902 | 30903 | 21 | Ryan's Bagel Cafe | 5.0 | 99.0 | Breakfast and Brunch, American, Sandwiches, Ba... | NaN | 10261 South 1300 East, Sandy, UT, 84094 | 84094 | 40.565180 | -111.852960 | UT | UT | Breakfast |
| 22723 | 22724 | 138 | Starbucks (3347 M Street Nw) | 5.0 | 97.0 | Bakery, Breakfast and Brunch, Cafe, Coffee &am... | $ | 3347 M Street Nw, Washington, DC, 20007 | 20007 | 38.905250 | -77.067710 | DC | DC | Breakfast |
| 36352 | 36353 | 5 | Arcadia Wine & Spirits | 5.0 | 94.0 | Alcohol, Liquor Stores, Wine | $$ | 5626 E R L Thornton Fwy, Dallas, TX, 75223 | 75223 | 32.790724 | -96.745450 | TX | TX | NaN |
| 36753 | 36754 | 2 | Starbucks (I-45 & 336) | 5.0 | 92.0 | Bakery, Breakfast and Brunch, Cafe, Coffee &am... | $ | 1403 North Loop 336, Conroe, TX, 77304 | 77304 | 30.332637 | -95.479687 | TX | TX | Breakfast |
| 34444 | 34445 | 5 | Starbucks (Brownfield & Milwaukee) | 5.0 | 91.0 | Bakery, Breakfast and Brunch, Cafe, Coffee &am... | $ | 5014 Milwaukee, Lubbock, TX, 79407 | 79407 | 33.546354 | -101.957598 | TX | TX | Breakfast |
| 11204 | 11205 | 148 | Thai Pod Restaurant | 5.0 | 89.0 | Thai | $ | 2015 NE Broadway St, Portland, OR, 97212 | 97212 | 45.535164 | -122.645562 | OR | OR | Thai |
One thing that stood out to us was the fact that so many Breakfast and American category restaurants were the highest ranking. This is probably because these categories included stores like Starbucks which we know many people frequent everyday. 7 of the 20 top restaurants using this calculation are Starbucks stores which also accounts for all the repeating category types. The remaining categories that make it into the top 20 are Vietnamese, Japanese, and Thai. The stores not labeled with our general categorization are Sweetgreen (salads) and a liquor store.
Moving on, we are taking a look at some stats by price range.
In the price_range column, each location is assigned a dollar-sign value that corresponds to how cheap or expensive the cost if food is there. In order to observe how this data is distributed and how it corresponds to other factors like number of ratings and average rating, we created three plots.
df['price_range_number'] = df['price_range'].apply(lambda x: 1 if x =='$' else 2 if x == '$$' else 3 if x == '$$$' else 4 if x == '$$$$' else np.nan)
The first is a bar graph which shows the number of restaurants that fall under each price_range ($, $$, $$$, $$$$). From this we can observe that most restaurants in the dataset are $ in price, while there are very few that are $$$ or $$$$ in price. The next is a box and whisker plot showing the relationship between each price category and the number of ratings. From this we can see that the median of the number of ratings is around 50 and is very close across the price ranges, but $ and $$ have a lot of outlier points. The third plot is also a box and whiskers plot that compares ratings across the price ranges. Again, the median across all plots is very similar and all around a 4.6 rating. Again, there are a lot of outliers for price points $ and $$.
fig, axes = plt.subplots(1,3, figsize=(15, 6))
fig.suptitle('Exploring Restaurant Prices')
sns.countplot(ax=axes[0], data=df, x='price_range_number', palette="Set1").set_title('The Price Categories')
sns.boxplot(ax=axes[1], data=df, x='price_range_number', y = 'number_of_ratings', palette="Set1").set_title('Number of Ratings by Price Category')
sns.boxplot(ax=axes[2], data=df, x='price_range_number', y = 'rating', palette="Set1").set_title('Average Rating by Price Category')
Text(0.5, 1.0, 'Average Rating by Price Category')
Category Analysis:
First, let's get the average rating by category. To do this, we group by category, then take the mean of the ratings score, and show the count of the number of ratings used. We also display them in descending order so that we can view it as a ranking.
df_by_cat_rating = df.groupby('gen_category')['rating'].agg(['mean', 'count'])
#print (df_by_cat_rating.sort_values(by=['mean'], ascending=False))
df_by_cat_rating.sort_values(by=['mean'], ascending=False, inplace=True)
print("Average Rating and Counts of Ratings for Each Category")
df_by_cat_rating.reset_index(inplace=True)
df_by_cat_rating.head(16)
Average Rating and Counts of Ratings for Each Category
| gen_category | mean | count | |
|---|---|---|---|
| 0 | Vietnamese | 4.708392 | 286 |
| 1 | Dessert | 4.707273 | 715 |
| 2 | Convenience | 4.705926 | 405 |
| 3 | Asian | 4.703976 | 327 |
| 4 | Vegetarian | 4.699338 | 151 |
| 5 | Thai | 4.690020 | 501 |
| 6 | Korean | 4.679143 | 350 |
| 7 | Japanese | 4.674190 | 988 |
| 8 | Breakfast | 4.653904 | 1870 |
| 9 | Mediterranean | 4.629263 | 475 |
| 10 | Pizza | 4.580267 | 2022 |
| 11 | Italian | 4.552778 | 252 |
| 12 | Chinese | 4.546014 | 1154 |
| 13 | Mexican | 4.522583 | 2400 |
| 14 | Indian | 4.516926 | 579 |
| 15 | American | 4.483449 | 8610 |
Vietnamese restaurants had the highest rating with Dessert and Convenience in close second and thirds. American had the lowest average rating despite having the most occurrences in the dataset. This may be because with more occurrences, there are more restaurants that can pull the rating up or down and maybe in this case down. There is a variety of American restaurant quality that may be affecting this too. Although this could be said about any of the categories, the sheer number of American restaurants is a factor that must be thought about. Generally, the restaurants with more ratings tend to be lower on the list such as America, Mexican, Chinese, Pizza, and Breakfast.
Let’s also do a visual representation of the average rating by category too.
plt.figure(figsize=(15,8))
sns.barplot(data=df_by_cat_rating, x="gen_category", y="mean")
plt.ylim(4, 5)
plt.xticks(rotation=90)
plt.title('Average Rating by Category')
plt.ylabel('Average Rating')
plt.xlabel('Category of Restaurant')
Text(0.5, 0, 'Category of Restaurant')
From this graph we can see that the average ratings of the categories are actually very close at the top and get wider as the ranking goes down. They are all still generally in the 4.4-4.7 range so no one category sticks out like a sore thumb. Next let's take a look at the count of restaurants by category at each price point.
We break them up into counts as seen below. For better readability lets graph the count of each restaurant category at each price point.
print (df.groupby(['gen_category','price_range']).size().unstack(fill_value=0))
price_range $ $$ $$$ $$$$ gen_category American 9685 3871 66 6 Asian 199 93 0 2 Breakfast 1789 934 3 1 Chinese 866 479 4 0 Convenience 1643 12 0 0 Dessert 1098 219 13 2 Indian 430 208 8 1 Italian 208 215 7 0 Japanese 624 394 13 1 Korean 337 80 3 1 Mediterranean 380 177 5 0 Mexican 2473 682 6 0 Pizza 2804 658 10 1 Thai 306 240 2 0 Vegetarian 205 46 1 0 Vietnamese 237 55 0 0
df['price_range_str'] = df['price_range'].apply(lambda x: 'Low' if x =='$' else 'Medium' if x == '$$' else 'High' if x == '$$$' else 'Extremely High' if x == '$$$$' else np.nan)
cat_price_ranges = df.groupby(['gen_category','price_range_str']).size().unstack(fill_value=0)
cat_price_ranges.reset_index(inplace=True)
cat_price_ranges.head()
| price_range_str | gen_category | Extremely High | High | Low | Medium |
|---|---|---|---|---|---|
| 0 | American | 6 | 66 | 9685 | 3871 |
| 1 | Asian | 2 | 0 | 199 | 93 |
| 2 | Breakfast | 1 | 3 | 1789 | 934 |
| 3 | Chinese | 0 | 4 | 866 | 479 |
| 4 | Convenience | 0 | 0 | 1643 | 12 |
plt.figure(figsize=(15,8))
sns.barplot(data=cat_price_ranges, x="gen_category", y="Low")
plt.xticks(rotation=90)
plt.title('Number of Restaurants at Low Price Point by Category')
plt.ylabel('Number of Restaurants')
plt.xlabel('Category of Restaurant')
plt.figure(figsize=(15,8))
sns.barplot(data=cat_price_ranges, x="gen_category", y="Medium")
plt.xticks(rotation=90)
plt.title('Number of Restaurants at Medium Price Point by Category')
plt.ylabel('Number of Restaurants')
plt.xlabel('Category of Restaurant')
plt.figure(figsize=(15,8))
sns.barplot(data=cat_price_ranges, x="gen_category", y="High")
plt.xticks(rotation=90)
plt.title('Number of Restaurants at High Price Point by Category')
plt.ylabel('Number of Restaurants')
plt.xlabel('Category of Restaurant')
plt.figure(figsize=(15,8))
sns.barplot(data=cat_price_ranges, x="gen_category", y="Extremely High")
plt.xticks(rotation=90)
plt.title('Number of Restaurants at Extremely High Price Point by Category')
plt.ylabel('Number of Restaurants')
plt.xlabel('Category of Restaurant')
Text(0.5, 0, 'Category of Restaurant')
When looking at these graphs, we can see that there is generally the same ratio of restaurants of each category at the differing price points. We always see American restaurants with the most counts and most of the other categories are closer together.
The 2nd, 3rd, and 4th, most occurrences by category change at each price point. At the lowest price point Pizza, Mexican, and Breakfast are 2,3, and 4 respectively. At the medium price point, Breakfast restaurants now have the second highest occurrences beating out Mexican and Pizza who are 3 and 4 respectively. At the high price point, we see the top 2-4 switch a bit. The 2nd most frequent category is now Dessert, the 3rd is Japanese, and the 4th is Pizza.
The extremely high price point only has restaurants from half the categories: American, Asian, Breakfast, Dessert, Indian, Japanese, Korean, and Pizza with the remaining 8 categories having none.
Maps
Since we were provided the longitude and latitude of each of the restaurants, we were able to explore the data further in terms of location. To begin, I created a heat map to show where all of our restaurants were concentrated. This will help us explore the areas that are concentrated more closely. For this code, we use the folium import to create the heat map.
# creating a heat map to consider the locations our data is based from
locs = []
heatmap = folium.Map(location=[39.8283, -98.5795], zoom_start=4)
for i, loc in df.iterrows():
if loc['lat'] != 0 and loc['lng'] != 0:
locs.append((loc['lat'], loc['lng']))
heatmap.add_children(plugins.HeatMap(locs, radius=18))
<ipython-input-35-8e0d01dc0b4f>:8: FutureWarning: Method `add_children` is deprecated. Please use `add_child` instead. heatmap.add_children(plugins.HeatMap(locs, radius=18))
From the heat map, we can see the data points being concentrated in certain areas, and more specifically certain cities. For example, we see data points concentrated in Washington, Utah, Texas, District of Colombia, Alabama, Ohio, Illinois, Vermont, Maryland, Virginia, Idaho, and Oregon. Some of the cities we saw a lot of our data concentrated in was Maryland, DC, Virginia, San Antonio, Seattle, Portland, Salt Lake City, and Burlington. After the analysis of the concentrations, we wanted to visually display where our highest priced restaurants were located. We decided to create a seperate graph with points instead. For this graph, we iterated through the rows of our data frame and plotted the graphs with the highest price range.
# create a map of where all the highest priced restaraunts are located
# center the map for the US
highest_map = folium.Map(location=[39.8283, -98.5795], zoom_start=4)
for index, row in df.iterrows():
if row["price_range"] == '$$$$':
folium.Marker(location=[row['lat'], row['lng']], popup= '$$$$', color='red',
icon=folium.Icon(color='red')).add_to(highest_map)
highest_map
From this map, we can now see the very few points of the highest range prices displayed across the US. We noticed that most of these are in major cities such as Seattle, San Francisco, and DC. We decided to explore this even further with more mapping. In the next few maps, we see maps with data points concentrated in Seattle, Washington, San Francisco, Texas, and DC. For each map, we highlighted the highest pay range restaurants by leaving them in red and as markers instead of circles. To take care of the restaurant entries that, unfortunately, did not have price range data for, we decided to still plot these entries on the graph, but in the color black
# create a map of restaurants in Seattle, Washington with correcponding price ranges
# filter dataset to restaurants in Washington
wa = ['WA']
waDF = df.loc[df["State"].isin(wa)]
# center the map at Washington DC
wa_map = folium.Map(location=[47.6511, -122.2401], zoom_start=9)
print(waDF['price_range'].unique())
for index, row in waDF.iterrows():
if row["price_range"] == '$':
folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='darkblue',
icon=folium.Icon(color='darkblue')).add_to(wa_map)
if row["price_range"] == '$$':
folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='darkpurple',
icon=folium.Icon(color='darkpurple')).add_to(wa_map)
if row["price_range"] == '$$$':
folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='green',
icon=folium.Icon(color='green')).add_to(wa_map)
if row["price_range"] == '$$$$':
folium.Marker(location=[row['lat'], row['lng']], popup= '$$$$', color='red',
icon=folium.Icon(color='red')).add_to(wa_map)
if row["price_range"] == 'nan':
folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='black',
icon=folium.Icon(color='black')).add_to(wa_map)
wa_map
['$' nan '$$' '$$$' '$$$$']
# create a map of restaurants in San Antonio, Texas with correcponding price ranges
# filter dataset to restaurants in Washington
tx = ['TX']
txDF = df.loc[df["State"].isin(tx)]
# center the map at Washington DC
tx_map = folium.Map(location=[29.5000, -98.4946], zoom_start=11)
print(txDF['price_range'].unique())
for index, row in txDF.iterrows():
if row["price_range"] == '$':
folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='darkblue',
icon=folium.Icon(color='darkblue')).add_to(tx_map)
if row["price_range"] == '$$':
folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='darkpurple',
icon=folium.Icon(color='darkpurple')).add_to(tx_map)
if row["price_range"] == '$$$':
folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='green',
icon=folium.Icon(color='green')).add_to(tx_map)
if row["price_range"] == '$$$$':
folium.Marker(location=[row['lat'], row['lng']], popup= '$$$$', color='red',
icon=folium.Icon(color='red')).add_to(tx_map)
if row["price_range"] == 'nan':
folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='black',
icon=folium.Icon(color='black')).add_to(tx_map)
tx_map
['$' nan '$$' '$$$' '$$$$']
# create a map of restaurants in DC with correcponding price ranges
# filter dataset to restaurants in DC
dc = ['DC']
dcDF = df.loc[df["State"].isin(dc)]
# center the map at Washington DC
pr_map = folium.Map(location=[38.9190, -77.0100], zoom_start=13)
print(dcDF['price_range'].unique())
for index, row in dcDF.iterrows():
if row["price_range"] == '$':
folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='darkblue',
icon=folium.Icon(color='darkblue')).add_to(pr_map)
if row["price_range"] == '$$':
folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='darkpurple',
icon=folium.Icon(color='darkpurple')).add_to(pr_map)
if row["price_range"] == '$$$':
folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='green',
icon=folium.Icon(color='green')).add_to(pr_map)
if row["price_range"] == '$$$$':
folium.Marker(location=[row['lat'], row['lng']], popup= '$$$$', color='red',
icon=folium.Icon(color='red')).add_to(pr_map)
if row["price_range"] == 'nan':
folium.vector_layers.Circle(location=[row['lat'], row['lng']], color='black',
icon=folium.Icon(color='black')).add_to(pr_map)
pr_map
[nan '$' '$$' '$$$' '$$$$']
Something we can conclude from this data is that most of our data entries are in the lower price range, even the bigger cities. But another trend we can see come out of this is that the uber eats data is concentrated around larger populations because the cities we saw concentration in the heat map are very populated. This could tell us something about people in larger cities being more inclined to choose to use a service such as uber eats. But this also might imply something about the regions in which our data is selective towards, and can make us question whether data was more collected from such areas. Something else we can draw our attention to is the locations of our higher priced restaurants. Although the data in general seems to be concentrated along certain states, the highest price range restaurants are almost always within the borders of the big cities in those states. This lets us begin to start thinking about incomes of the people living in these areas, as well as the number of people that populate these areas.
To get specific values for the price range descriptions we already have, we will calculate the average price of menus at each restaurant using the restaurant-menus file and then add that data point to our main dataframe.
df2 = pd.read_csv('/content/drive/MyDrive/restaurant-menus.csv')
df2.head()
| restaurant_id | category | name | description | price | |
|---|---|---|---|---|---|
| 0 | 1 | Extra Large Pizza | Extra Large Meat Lovers | Whole pie. | 15.99 USD |
| 1 | 1 | Extra Large Pizza | Extra Large Supreme | Whole pie. | 15.99 USD |
| 2 | 1 | Extra Large Pizza | Extra Large Pepperoni | Whole pie. | 14.99 USD |
| 3 | 1 | Extra Large Pizza | Extra Large BBQ Chicken & Bacon | Whole Pie | 15.99 USD |
| 4 | 1 | Extra Large Pizza | Extra Large 5 Cheese | Whole pie. | 14.99 USD |
def find_number(text):
num = re.findall(r'[0-9]+.[0-9]+',text)
return " ".join(num)
df2['avg_menu_price']=df2['price'].apply(lambda x: find_number(x))
df2['avg_menu_price'] = df2['avg_menu_price'].astype(float)
df2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3375211 entries, 0 to 3375210 Data columns (total 6 columns): # Column Dtype --- ------ ----- 0 restaurant_id int64 1 category object 2 name object 3 description object 4 price object 5 avg_menu_price float64 dtypes: float64(1), int64(1), object(4) memory usage: 154.5+ MB
avg_price = df2.groupby('restaurant_id')['avg_menu_price'].mean().to_frame()
avg_price.reset_index(inplace=True)
avg_price.head()
| restaurant_id | avg_menu_price | |
|---|---|---|
| 0 | 1 | 5.663684 |
| 1 | 2 | 5.505333 |
| 2 | 3 | 10.762143 |
| 3 | 4 | 10.531892 |
| 4 | 5 | 4.532576 |
df = pd.merge(df, avg_price, left_on='id', right_on='restaurant_id')
df.head()
| id | position | name | rating | number_of_ratings | category | price_range | full_address | zip_code | lat | lng | State | state_by_zip | gen_category | price_range_number | price_range_str | restaurant_id | avg_menu_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 19 | PJ Fresh (224 Daniel Payne Drive) | NaN | NaN | Burgers, American, Sandwiches | $ | 224 Daniel Payne Drive, Birmingham, AL, 35207 | 35207 | 33.562365 | -86.830703 | AL | AL | American | 1.0 | Low | 1 | 5.663684 |
| 1 | 2 | 9 | J' ti`'z Smoothie-N-Coffee Bar | NaN | NaN | Coffee and Tea, Breakfast and Brunch, Bubble Tea | NaN | 1521 Pinson Valley Parkway, Birmingham, AL, 35217 | 35217 | 33.583640 | -86.773330 | AL | AL | Breakfast | NaN | NaN | 2 | 5.505333 |
| 2 | 3 | 6 | Philly Fresh Cheesesteaks (541-B Graymont Ave) | NaN | NaN | American, Cheesesteak, Sandwiches, Alcohol | $ | 541-B Graymont Ave, Birmingham, AL, 35204 | 35204 | 33.509800 | -86.854640 | AL | AL | American | 1.0 | Low | 3 | 10.762143 |
| 3 | 4 | 17 | Papa Murphy's (1580 Montgomery Highway) | NaN | NaN | Pizza | $ | 1580 Montgomery Highway, Hoover, AL, 35226 | 35226 | 33.404439 | -86.806614 | AL | AL | Pizza | 1.0 | Low | 4 | 10.531892 |
| 4 | 5 | 162 | Nelson Brothers Cafe (17th St N) | 4.7 | 22.0 | Breakfast and Brunch, Burgers, Sandwiches | NaN | 314 17th St N, Birmingham, AL, 35203 | 35203 | 33.514730 | -86.811700 | AL | AL | American | NaN | NaN | 5 | 4.532576 |
Since we saw a somewhat specific pattern of where the higher priced restaurants were located (in the midst of cities), we decided to explore the correlation that average income per zipcode and/or the populations in certain cities have on the population of highest priced restaurants in these cities. We were able to find a cvs from a government website which included the average income by zipcode, which we knew would fit perfectly with our restaurants because those also can be split on zipcode. We imported this data into our notebook. Next, we needed to merge the average income and total population columns into the main dataframe. To do this we need to clean up the zipcode columns so that no special characters were present. We did a regex to keep only the numbers and made it a separate column. To make sure the column types matched up, we converted the zipcode columns to floats since NaN values were not liked by the compiler. We then did a merge of the two datasets by the zipcode columns and the average income and total population columns were added to our main dataset.
income_df = pd.read_csv("/content/drive/MyDrive/postcode_level_averages.csv")
avg_income = income_df[["zipcode", "total_pop", "avg_income"]]
avg_income.head()
df['zip_code'] = df['zip_code'].astype(str)
def find_number2(text):
num = re.findall(r'[0-9]+',text)
return " ".join(num)
df['zipcode clean']=df['zip_code'].apply(lambda x: find_number2(x).split(' ', 1)[0])
df['zipcode clean']=df['zipcode clean'].apply(lambda x: np.nan if x == '' else x)
df['zipcode clean'] = df['zipcode clean'].astype(float)
avg_income['zipcode'] = avg_income['zipcode'].astype(float)
df4 = pd.merge(df, avg_income, left_on='zipcode clean', right_on='zipcode')
df4 = pd.merge(df, avg_income, left_on='zipcode clean', right_on='zipcode')
df4.head()
<ipython-input-47-f034c56fcfed>:14: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy avg_income['zipcode'] = avg_income['zipcode'].astype(float)
| id | position | name | rating | number_of_ratings | category | price_range | full_address | zip_code | lat | ... | state_by_zip | gen_category | price_range_number | price_range_str | restaurant_id | avg_menu_price | zipcode clean | zipcode | total_pop | avg_income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 19 | PJ Fresh (224 Daniel Payne Drive) | NaN | NaN | Burgers, American, Sandwiches | $ | 224 Daniel Payne Drive, Birmingham, AL, 35207 | 35207 | 33.562365 | ... | AL | American | 1.0 | Low | 1 | 5.663684 | 35207.0 | 35207.0 | 3110 | 26956.913183 |
| 1 | 24 | 32 | Cinnabon baked at Flying J (224 Daniel Payne D... | NaN | NaN | Bakery, Desserts | $ | 224 Daniel Payne Drive, Birmingham, AL, 35207 | 35207 | 33.562260 | ... | AL | Dessert | 1.0 | Low | 24 | 6.706129 | 35207.0 | 35207.0 | 3110 | 26956.913183 |
| 2 | 164 | 59 | Cosmic Wings - 5 Points West | 3.5 | 13.0 | American, Bar Food, Wings, Fast Food, Chicken,... | $ | 2246 Bessemer Road, Birmingham, AL, 35207 | 35207 | 33.497590 | ... | AL | American | 1.0 | Low | 164 | 9.341111 | 35207.0 | 35207.0 | 3110 | 26956.913183 |
| 3 | 171 | 51 | Denny's (224 Daniel Payne Drive N) | 3.5 | 68.0 | American, Breakfast and Brunch, Coffee and Tea... | $$ | 224 Daniel Payne Drive N, Birmingham, AL, 35207 | 35207 | 33.562544 | ... | AL | Breakfast | 2.0 | Medium | 171 | 9.503614 | 35207.0 | 35207.0 | 3110 | 26956.913183 |
| 4 | 2 | 9 | J' ti`'z Smoothie-N-Coffee Bar | NaN | NaN | Coffee and Tea, Breakfast and Brunch, Bubble Tea | NaN | 1521 Pinson Valley Parkway, Birmingham, AL, 35217 | 35217 | 33.583640 | ... | AL | Breakfast | NaN | NaN | 2 | 5.505333 | 35217.0 | 35217.0 | 4900 | 35761.224490 |
5 rows × 22 columns
Hypothesis Testing
Now that we have the average menu price for the restaurants, the average income per zip code, and the total population per zip code, we have all the information we need to perform a linear regression with this information to explore the idea of it being correlated.
Null Hypothesis: The average income is not correlated to the restaurant prices per zipcode.
We first are exploring the correlation between average income and the price range number. If you recall, each price range for each restaurant was assigned a price range and was ranked from 1-4 (1 being lower priced and 4 being the highest priced). The scatter plot below shows income on the x axis and the price range on the y axis.
#df4 = df4.drop('zipcode', axis=1)
# use average menu price and average income to get the linear regression model
df4 = df4.loc[df4["avg_income"].notna()]
df4 = df4.loc[df4["price_range_number"].notna()]
x = np.array(df4['avg_income']).reshape(-1,1)
y = np.array(df4['price_range_number'])
regression = linear_model.LinearRegression().fit(x, y)
# extract the slope and y intercept from the linear regression model
m = regression.coef_
b = regression.intercept_
# display information
print("Slope: ", m)
print("Y-intercept: ", b)
print("Linear Regression Model: ", m, "x", b)
Slope: [-1.55019176e-08] Y-intercept: 1.280730740137753 Linear Regression Model: [-1.55019176e-08] x 1.280730740137753
# create the basic scatter plot
plt.plot(x, y, 'o')
plt.plot(x, m*x+b, color = 'orange')
[<matplotlib.lines.Line2D at 0x7f547ca4c940>]
import statsmodels.api as sm
# do the linear regression
x = df4['avg_income']
y = df4['price_range_number']
x = sm.add_constant(x)
model = sm.OLS(y, x).fit()
predictions = model.predict(x)
#print out the model summary
print_model = model.summary()
print(print_model)
OLS Regression Results
==============================================================================
Dep. Variable: price_range_number R-squared: 0.000
Model: OLS Adj. R-squared: -0.000
Method: Least Squares F-statistic: 0.1107
Date: Fri, 16 Dec 2022 Prob (F-statistic): 0.739
Time: 20:23:56 Log-Likelihood: -21308.
No. Observations: 32947 AIC: 4.262e+04
Df Residuals: 32945 BIC: 4.264e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.2807 0.005 259.983 0.000 1.271 1.290
avg_income -1.55e-08 4.66e-08 -0.333 0.739 -1.07e-07 7.58e-08
==============================================================================
Omnibus: 5131.186 Durbin-Watson: 1.642
Prob(Omnibus): 0.000 Jarque-Bera (JB): 7969.906
Skew: 1.201 Prob(JB): 0.00
Kurtosis: 3.200 Cond. No. 2.05e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.05e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
/usr/local/lib/python3.8/dist-packages/statsmodels/tsa/tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only x = pd.concat(x[::order], 1)
#df4 = df4.drop('zipcode', axis=1)
# use average menu price and average income to get the linear regression model
x1 = np.array(df4['avg_income']).reshape(-1,1)
y1 = np.array(df4['avg_menu_price'])
regression = linear_model.LinearRegression().fit(x1, y1)
# extract the slope and y intercept from the linear regression model
m1 = regression.coef_
b1 = regression.intercept_
# display information
print("Slope: ", m1)
print("Y-intercept: ", b1)
print("Linear Regression Model: ", m1, "x", b1)
Slope: [1.06466071e-05] Y-intercept: 8.949484155637846 Linear Regression Model: [1.06466071e-05] x 8.949484155637846
# create the basic scatter plot
plt.plot(x1, y1, 'o')
plt.plot(x1, m1*x1+b1, color = 'orange')
[<matplotlib.lines.Line2D at 0x7f5477e94520>]
import statsmodels.api as sm
# do the linear regression
x1 = df4['avg_income']
y1 = df4['avg_menu_price']
x1 = sm.add_constant(x1)
model = sm.OLS(y1, x1).fit()
predictions = model.predict(x1)
#print out the model summary
print_model = model.summary()
print(print_model)
OLS Regression Results
==============================================================================
Dep. Variable: avg_menu_price R-squared: 0.008
Model: OLS Adj. R-squared: 0.008
Method: Least Squares F-statistic: 268.6
Date: Fri, 16 Dec 2022 Prob (F-statistic): 4.06e-60
Time: 20:23:56 Log-Likelihood: -1.0813e+05
No. Observations: 32947 AIC: 2.163e+05
Df Residuals: 32945 BIC: 2.163e+05
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 8.9495 0.069 130.282 0.000 8.815 9.084
avg_income 1.065e-05 6.5e-07 16.388 0.000 9.37e-06 1.19e-05
==============================================================================
Omnibus: 35430.126 Durbin-Watson: 1.886
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4542504.557
Skew: 5.292 Prob(JB): 0.00
Kurtosis: 59.542 Cond. No. 2.05e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.05e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
/usr/local/lib/python3.8/dist-packages/statsmodels/tsa/tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only x = pd.concat(x[::order], 1)
Analysis
As we can see, there is little to no correlation between the average income and price_range_number. The slope we calculate is extremely small and negative (-1.55019176e-08). But to make a claim of our null hypothesis based on these x and y values, we discovered that the p value of the correlation coefficient is 0.739. Since p is not less than .05, we cannot reject our null hypothesis based on the variables used in this scatter plot to show the relationship.
But since we had averaged out the prices per restaurant individually, we decided to see if that would make a difference in how these characteristics are correlated. For this next scatter plot, we use average income as the x axis and used average menu price as the y axis.
This resulted in a linear regression with a slope of 1.06466071e-05, which showed more correlation than the last linear regression we did. This allows us to move further with our analysis. Using the same x and y values, we were able to calculate the p value as well which ended up being 0.000. Since this p value is less than .05, we are able to reject our null hypothesis of average income and average prices at restaurant being uncorrelated.
Moving to Classifications, we wanted to see how well the average income and populations for every zipcode would be able to predict whether a price is higher or lower. Since we needed a binary target, we decided to tackle this first. We split the 4 ranges we had already calculated (price_range) into whether or not it is a higher price. I created another column to our dataframe called is_highest. For this column, if the price range was the lower two price ranges, we say no. If it is from the higher two price ranges, we say yes.
# make another coloumn called 'is_higher', yes for $$$$, else no
temp = []
for index, row in df4.iterrows():
if row['price_range'] == '$$$$' or row['price_range'] == '$$$':
temp.append('Yes')
# print("here")
else:
temp.append('No')
df4["is_higher"] = temp
Using this, we perform a KNN Classification and Random Forrest Classification. We use average income and population as the predicting features, and target will be is_higher (basically predicting is the restaurant is a higher priced restaurant based on the population and average score of its location). I will use this data to train a portion to be able to predict whether or not the restaurant is priced higher or not, and use the other portion to test how accuratley the data was trained. For both classifiers, the data is first split into data for training vs testing. The K Nearest Neighbor and the Random Forest algorithms are first ran on the training set (data is fitted) and then uses the test category of the data to predict whether the restaurant is higher priced or not. I used the 10-fold cross-validation method to measure how accuratley the algorithm predicted the correct class (Yes/No).
# Classification
# prepare data into dataframe
data = df4[['total_pop', 'avg_income']].copy()
target = df4[['is_higher']].copy()
# K-NN Classification
x_train, x_test, y_train, y_test = train_test_split(data, target)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train, y_train)
knn.score(x_test, y_test)
# 10-fold cross validation
cvs = cross_val_score(knn, data, target, cv = 10)
# average accuracy across all the splits
print("K-NN Classification Average Accuracy: " + str(cvs.mean()))
print("K-NN Classification Standard Error: " + str(stats.sem(cvs)))
/usr/local/lib/python3.8/dist-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /usr/local/lib/python3.8/dist-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /usr/local/lib/python3.8/dist-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /usr/local/lib/python3.8/dist-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /usr/local/lib/python3.8/dist-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /usr/local/lib/python3.8/dist-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /usr/local/lib/python3.8/dist-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /usr/local/lib/python3.8/dist-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /usr/local/lib/python3.8/dist-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /usr/local/lib/python3.8/dist-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /usr/local/lib/python3.8/dist-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y)
K-NN Classification Average Accuracy: 0.9733834727784826 K-NN Classification Standard Error: 0.011979384877157192
# Random Forest Classification
x_train, x_test, y_train, y_test = train_test_split(data, target)
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(x_train, y_train)
rfc.predict(x_test)
rfc.score(x_test, y_test)
# 10-fold cross validation
cvs2 = cross_val_score(rfc, data, target, cv = 10)
# average accuracy across all the splits
print("Random Forest Classification Average Accuracy: " + str(cvs2.mean()))
print("Random Forest Classification Standard Error: " + str(stats.sem(cvs2)))
<ipython-input-56-872101601dfa>:4: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). rfc.fit(x_train, y_train) /usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py:680: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). estimator.fit(X_train, y_train, **fit_params) /usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py:680: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). estimator.fit(X_train, y_train, **fit_params) /usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py:680: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). estimator.fit(X_train, y_train, **fit_params) /usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py:680: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). estimator.fit(X_train, y_train, **fit_params) /usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py:680: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). estimator.fit(X_train, y_train, **fit_params) /usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py:680: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). estimator.fit(X_train, y_train, **fit_params) /usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py:680: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). estimator.fit(X_train, y_train, **fit_params) /usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py:680: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). estimator.fit(X_train, y_train, **fit_params) /usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py:680: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). estimator.fit(X_train, y_train, **fit_params) /usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py:680: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). estimator.fit(X_train, y_train, **fit_params)
Random Forest Classification Average Accuracy: 0.8999370446841777 Random Forest Classification Standard Error: 0.036725425918588524
Analysis
We can conclude that, with the parameters we used, K-NN Classification Average had a better prediction performance than the Random Forest Classification because K-NN Classification had an average of being accurate 97.33% of the time while Random Forest Classification had an average of being accurate 89.99% of the time. But overall, because both classifications had relatively well predictions, population and average income for zipcodes did a relatiely good job predicting whether a restaurant has a price range on the higher end.
Conclusion
Based on our analysis of the Uber Eats data, there are some insights we can state. Comparing the restaurants that were priced on different scales was difficult, so we wanted to explore some of the reasons for the distribution of where the higher priced restaurants were placed. We thought income in certain areas might be a very significant factor on where the higher priced restaurants are located compared to those that are lower priced. But it turns out that the correlation is minimal. Once we calculated the averages for ourselves of what the average price was for each restaurants instead of just using a range that the data set already provided, we could reject the null hypothesis that these characteristics were uncorrelated. We also were able to see that using population in addition to average income per area worked relatively well in training our data set to predict whether the restaurant was higher or lower priced. This shows us that having both of these predictors would create more correlation than just using average income to determine price. Overall, this data on Uber Eats was able to provide us with distribution, ratings and prices of popular restaurants in many popular cities throughout the United States. We were able to use this information to explore factors that might go into the pricing and locations of these restaurants such as average income per zipcode and populations of the zipcodes in question.